CzarniakMichelSedar Github Repository
The impetus behind this research lies in the intricate interplay between socioeconomic factors and power plants. As the global energy landscape undergoes a transformative shift toward sustainability, it becomes paramount to understand how these changes might impact communities, particularly those situated in lower-income communities. The socio-spatial lens through which this study is conducted aims to unearth patterns of power plant distribution across North Carolina. With 100 counties and a total of 843 power plants consisting of 1190 generators (US EPA, 2021), the state’s unique landscape offers a rich context for investigation. The overarching question is whether certain communities bear a disproportionate burden of environmentally impactful energy sources, framing the discourse within the realms of environmental justice and equity.
As a key aspect of this exploration, the research considers the connection between power plant location and emissions and various socioeconomic indicators, including but not limited to unemployment rates. This exploration is motivated by the awareness that power plants have often served as major employers in low-income areas (Union of Concerned Scientists, 2021). This research extends its focus beyond energy sources to examine the broader impacts of power plants on health and social vulnerability, and scrutinizes disparities in pollution exposure and associated health risks at a high level, using publicly available data. Through an exploration that encompasses various socioeconomic indicators, this research contributes to a more nuanced understanding of the multifaceted impacts of power plants.
The following research questions are addressed:
1. How does income impact power plant characteristics at the county level?
2. Do power plant retirements have a significant impact on unemployment?
3. Does publicly available data show a relationship between power generation and social vulnerability, health vulnerability, or environmental burden?
Data used in this analysis comes from a variety of sources. A brief overview of each source is outlined below and full citations can be found at the end of this report. More information on the metadata for each source can be found in the Metadata folder on github.
The study utilizes the 2020 and 2021 eGRID provided by the U.S. Environmental Protection Agency to acquire essential power plant characteristics. This annually updated database offers comprehensive details on emissions, emission rates, generation, heat input, resource mix, location, and various other indicators. For the specific analyses undertaken in this research, key data elements employed included the plant Federal Information Processing Standards (FIPS) code, generation, nameplate capacity, and emissions by plant.
To process this dataset, the plant and generator sheets undergo individual modifications before being integrated. Notably, columns relevant to the analyses were selected, an additional column was introduced to consolidate state and county FIPS codes into a unified county FIPS code, and the overall dataset was filtered to North Carolina, exclusively. A parallel procedure is executed for the generator sheet before merging the two datasets using the unique ORISPL code.
It is important to highlight that although the most recent eGRID database available was from 2021, this version lacked historical generator retirement data. Consequently, the 2020 eGRID database is incorporated to facilitate the analysis of unemployment trends related to generator retirements.
To establish a nexus between power plants and socioeconomic factors, household income data is derived from data compiled by the Economic Research Service of the United States Department of Agriculture. This data includes metrics such as household income and poverty rates at the county level, with identification facilitated by the Federal Information Processing Standards (FIPS) Code. The preparation of this dataset involves the extraction of pertinent columns, specifically median household income, followed by integration with the initial eGRID dataframe. The unifying factor for this integration is the FIPS Code, which served as the common column linking the two datasets.
As an additional facet of investigation, the analysis incorporates unemployment data obtained from the Economic Research Service of the United States Department of Agriculture. The dataset, spanning the years 2000 through 2021, comprises comprehensive information on the total labor force, the number of employed individuals, the number of unemployed individuals, and the unemployment rate. The data wrangling process for this dataset unfolds in two distinct stages. Initially, the columns of interest, specifically those pertaining to the unemployment rate, are selected. Following this, the identified columns undergo appropriate conversions to their intended data types before being merged with the existing dataset, utilizing the county FIPS code as the linking identifier.
The subsequent phase of data wrangling introduces additional steps specifically tailored for conducting regression analyses between generator retirements and unemployment rates. To facilitate this analysis, the code executes a transposition of the data. This transformation presents North Carolina county unemployment rates by year in a more streamlined format, allocating a dedicated column for each year.
2022 report published by the U.S. CDC on environmental burden, social vulnerability, and health vulnerability indicators. This work builds on the Environmental Protection Agency (EPA)’s EJSCREEN. The indicators (social vulnerability, environmental burden, health vulnerability, and the variables that contribute to each of them) were selected based on a literature review conducted between December 2020 and December 2021. The Environmental Justice Index describes itself as “the first national, place-based tool designed to measure the cumulative impacts of environmental burden through the lends of human health and health equity.”
After learning about the Social Vulnerability Index, we continued exploring information available on the CDC’s Agency for Toxic Substances and Disease Registry site and, upon finding the EJI, were compelled to explore whether we might test hypotheses we had that generation and emissions from power plants would be correlated with increased environmental burden and health vulnerability within the year of 2021 in North Carolina, specifically. It is crucial to note that the EJI, like any tool produced for the national scale, has limitations baked into the resolution at which it provides information. We believe it is important to highlight from their data documentation, that “injustice occurs locally. High-level tools such as the EJI cannot capture all social, environmental, or health issues that a community may face.” Detail on limitations and considerations can be found on page 7 of the EJI data documentation in the Metadata folder on github.
2020 data from Centers for Disease Control and Prevention/ Agency for Toxic Substances and Disease Registry/ Geospatial Research, Analysis, and Services Program is utilized for visualizations of data at the county level as well as the “cb_2018_us_county_20m” shapefile, which is a cartographic boundary file from the United States Census Bureau. 2018 was the most recent year available for Cartographic Boundary Files downloads.
| Dataset | Attribute | Description |
|---|---|---|
| eGRID | Key Variables | Plant locations, production, capacity, retirements, and emissions |
| eGRID | Data Hierarchy | Generators are associated to plants, plants are associated to counties and FIPS codes |
| eGRID | Data Range | 2020 (dataset used only to reference historical retirement data) and 2021 (all other data) |
| eGRID | Data Source | The United States Environmental Protection Agency |
| Income | Key Variables | Median Household Income, FIPS codes |
| Income | Data Hierarchy | Income data is associated with FIPS county code |
| Income | Data Range | 2021 |
| Income | Data Source | The Economic Research Service of the United States Department of Agriculture |
| Unemployment | Key Variables | Unemployment rate, FIPS code |
| Unemployment | Data Hierarchy | Unemployment rates are associated with FIPS county code |
| Unemployment | Data Range | 2000 to 2021 |
| Unemployment | Data Source | The Economic Research Service of the United States Department of Agriculture |
| Social Vulnerability Index | Key Variables | Socioeconomic status, household characteristics, racial & ethnic minority status, and house type & transportation |
| Social Vulnerability Index | Data Hierarchy | Indicator data is available at the tract level |
| Social Vulnerability Index | Data Range | 2020 |
| Social Vulnerability Index | Data Source | The United States Centers for Disease Control and Prevention/Agency for Toxic Substances and Disease Registry Social Vulnerability Index (CDC/ATSDR SVI) |
| Environmental Justice Index | Key Variables | Environmental burden, social vulnerability, and health vulnerability |
| Environmental Justice Index | Data Hierarchy | Indicator data is available at the tract level |
| Environmental Justice Index | Data Range | 2020 to 2021 |
| Environmental Justice Index | Data Source | The United States Agency for Toxic Substances and Disease |
The initial step in our research, once loaded, was data wrangling on the eGRID, unemployment, and income datasets. Data wrangling, or the process of transforming raw data, was a crucial step in this research, particularly due to the diverse nature of the datasets. For the eGRID dataset, the code selects specific columns related to power plant information in North Carolina, converts certain columns to numeric format, and then combines the generator and plant data based on a common identifier. Duplicate columns are removed, resulting in the plant_gen_data dataframe. The code then proceeds to handle the unemployment dataset, selecting relevant columns covering the years 2002-2022 and merging it with the plant_gen_data dataframe using the FIPS code as a key. Similarly, the income dataset is processed by selecting specific columns, and the final step involves merging this income data with the previously created dataframe. The resulting dataset, named processed_data, contains information on power plants, unemployment rates, and income data, and the column names are displayed at the end of the code.
The SVI dataset was available at the county level, so it was simpler than the process for EJI, which follows. After import, SVI variables of interest were read into a data frame. Alongside this, plant generation data was grouped by county and summarized and later merged with the county-level SVI data.
The data wrangling process for the EJI data was complex, so the major steps are described here. Key pieces of the data wrangling process for Q3 involved first establishing a shared GEOID format, then getting 2021 spatial census data to match tract-level GEOIDs with their respective coordinates by using tigris::tracts for North Carolina. Coordinate systems were checked throughout to ensure alignment. The EJI attributes were then joined to the county spatial features with geometry, by the GEOID field. Some tracts did not have EJI data. Variables of interest were then selected within the newly-created tracts_EJI_sf_select. The next step was to match power plant data to tracts, which was possible because LAT and LON were available for the plants. After checking coordinate systems (NAD1983, EPSG 4269), the coordinates of the power plants in North Carolina were matched to tract geometry via st_intersection with the tracts_EJI_sf_select, into intersection_plants_EJI.
In order to get tract-level values for generation and emissions, which were at that point existing at the plant level, intersection_plants_EJI was grouped by GEOID and summarized such that variables of interest were summed from plant level to tract level. For example, the new column “Tract_Generation” was formed from the sum of the values of power plant generation for each plant in a tract, for each tract. With the addition of the tract-level EJI variables of interest, analyses on relationships between emissions, generation, and tract-level EJI data could finally be run. This was important because the EJI data documentation discourages adding up tract-level values to get to the county level. Information living at a common spatial resolution (being able to identify what tract a power plant was in) was a crucial foundation for running analyses. Lastly, NAs were dropped for variables of interest.
This analysis initiates data exploration and visualization for the processed_data dataframe, focusing on fuel types, income, and electricity generation in North Carolina. Figure 4.1 generates a pie chart to illustrate the percentage makeup of primary fuel types in North Carolina based on total annual generation. Subsequently, the scatter plot in Figure 4.2 depicts the relationship between median household income and total plant annual generation, with points colored by fuel type. The exploratory analysis further drills down to scatter plots excluding nuclear data and for a selected set of fuel types (COAL, SOLAR, OIL, WIND, GAS). Additionally, stacked bar plots in Figures 4.5 and 4.6 visualize the total annual generation by fuel type and county, and for better clarity, separate plots are created for the top 10 and bottom 10 counties based on median household income. These visualizations help explore and understand the relationships between fuel types, income levels, and electricity generation in different counties of North Carolina.
Figure 4.1: Pie chart illustrating the percentage distribution of primary fuel types in North Carolina based on total annual generation.
Figure 4.2: Scatter plot illustrating the correlation between median household income and total annual generation, color-coded by fuel type. Owing to the unique characteristics of nuclear power plants, operating at significantly higher capacity factors than other sources, the plot lacks a discernible relationship, prompting necessary adjustments.
Figure 4.3: Exploring energy patterns, this scatter plot analyzes the connection between median household income and total annual generation, excluding nuclear energy. The updated chart reveals a potential trend, especially around the $70,000 median income mark, where data points for generation are sparse.
Figure 4.4: To narrow our exploration further, we selected key energy sources: coal, gas, oil, wind, and solar.
Figure 4.5: Stacked bar plot illustrating the total annual generation by fuel type in the top 10 counties (by median income) in North Carolina.
Figure 4.6: To get a balanced view of the total annual generation by fuel type and county, we then show the bottom 10 counties (by median income) in North Carolina. This identified Richmond county as a county of interest.
We then created initial maps to better understand the landscape of North Carolina in terms of socioeconomic status and geographic distribution of power plants. Figure 4.7 below shows the mapping of power plants across the state and helped us to visualize the concentration of power plants across the state. Figure 4.8 maps the nameplate capacity of those power plants by county. We noticed that multiple counties, especially Richmond county, stood out as having a high concentration of capacity. Figure 4.9 demonstrates a heat map of the percentage of each county’s population living in poverty as of 2021. Richmond county is once again a clear visual outlier on this map.
Figure 4.7: Locations of active power plants across North Carolina as of 2021. Each black dot represents a power plant.
Figure 4.8: This map shows the total nameplate capacity in Megawatts of each county in North Carolina. Multiple counties have no power plants and are blank. Richmond County is an interesting outlier which the highest capacity of all the counties.
Figure 4.9: Percent of total county population living in poverty in 2021, as defined by the USDA Economic Research Service.
Additionally, we explored power plant emissions geographically as a final stage of our exploratory analysis. Figures 4.10, 4.11, 4.12, and 4.13 demonstrate the aggregated power plant emissions per county in 2021 for CO2, NOx, SO2, and CH4 respectively. Unfortunately eGrid did not provide data for Mercury (Hg) in 2021 so that was not analyzed. CH4 is provided in lbs while CO2, NOx, and SO2 are in short tons. For CO2 emissions, Richmond county stood out. For NOx emissions, Catawba county stood out. For SO2 emissions, Catawba, Haywood, and Person counties stood out. For CH4 emissions, Catawba and Person counties stood out.
Figure 4.10: Total CO2 Emissions in 2021 across North Carolina Power Plants.
Figure 4.11: Total NOx Emissions in 2021 across North Carolina Power Plants.
Figure 4.12: Total SO2 Emissions in 2021 across North Carolina Power Plants
Figure 4.13: Total CH4 Emissions in 2021 across North Carolina Power Plants.
As exploratory work into the dimensions of social vulnerability, environmental burden, and health vulnerability in relation to generation and emissions across power plants in North Carolina, tests for normality were conducted on the overall index value for each of the aggregated themes (SVI, EBM, HVM). The distribution of the variables corresponding to these three overall themes of interest can be seen in Figures 4.14, 4.15, and 4.16.
Figure 4.14: Test for normality in distribution of Social Vulnerability Module percentile ranks. Normal Q-Q Plot does not show a totally normal distribution. Distribution is skewed such that most observations fall in the second half.
Figure 4.15: Test for normality in distribution of Environmental Burden Module percentile ranks. Normal Q-Q Plot does not show a totally normal distribution. Distribution is skewed such that most observations fall in the second half.
Figure 4.16: Test for normality in distribution of Environmental Burden Module percentile ranks. Distribution is shows that HVM (as Percentile Rank) is ordered but not continuous, the way EBM is. EJI obtains its data on indicators from a variety of sources. RPL_HVM, specifically, can take a value of 0, 0.2, 0.4, 0.6, 0.8, or 1.0.
As the last set of geographic explorations, two maps were created. The first was a map of NC power plants overlaid on tract boundaries to ensure that the matching of tracts to power plants for based on spatial features for Q3 analyses was successful (see Figure 4.17). The second maps asthma prevalence across NC tracts, as a visual grounding point in relation to Q3 analyses, given that the emissions considered in this project are known to have respiratory effects in humans (see Figure 4.18).
2.0.4 Social Vulnerability Index
2020 data from the U.S. CDC on key vulnerability criteria including socioeconomic status, household characteristics, racial & ethnic minority status, and house type & transportation. There are four theme variables that contribute to the Overall Vulnerability index value: Socioeconomic Status (RPL_THEME1), Household Characteristics (RPL_THEME2), Racial & Ethnic Minority Status (RPL_THEME3), and Housing Type & Transportation (RPL_THEME4). The variables that contribute to each of these themes are estimated using the American Community Survey (ACS), 2016-2020 (5-year data). U.S. tracts are ranked based on percentiles. Percentile ranking values range from 0 to 1, with higher values indicating greater vulnerability. Percentile ranks are available for the 16 individual variables that feed into the four themes, the four themes, and the overall vulnerability position.
Initially, this data set was chosen because it contains socioeconomic information at the county and tract level that was not available for 2020 in other sources we reviewed. Once we began to explore the dataset, and learned it contained variables we did not previously know were publicly available at this resolution, we chose to continue working with it beyond just the unemployment and income information we were initially seeking, and expanded to consider relationships between the broader set of variables it uses.